This dataset was taken from the The movie database (TMDb), containing information about 10000 movies released between the years 2015 and 1960.
Number of Rows: 10866
Number of columns: 20
Columns:
After taking a look at the dataset, there are multiple questions that popped out in my mind regarding some feature and the relationships between them.
Question 1: Which genres are the most popular from year to year ?
Question 2: Which movies have the most profit ? and in which year they were released ?
Question 3: What kind of characteristics are associated with movies having high revenues ?
Question 4: Which movies have the highest budgets but are low on vote counts ?
Question 5: Which movies have the lowest budgets but are very high on vote counts ?
Question 6: what are the movies that flopped and topped the most in terms of profit and votes ? and who was their cast/director ?
Question 7: Is the screen runtime related to the success or flopping of the movies ?
Question 8: What is the runtime of the most successful movies between the years 2015 and 1960 ?
Question 9: Which production companies that released the most successful movies each year ?
Question 10: Do movies with more profit have more popularity as well?
Question 11: What is the relationship between Budget and other features in the dataset ?
Question 12: Which cast participated in the most successful/flopped movies ?
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline
#to make visualisations be plotted inline within the notebook
# Upgrade pandas to use dataframe.explode() function.
#!pip install --upgrade pandas==0.25.0
# Load your data and print out a few lines. Perform operations to inspect data
df = pd.read_csv('tmdb-movies.csv', index_col='id')
df.head()
df.tail()
We can notice that there are some columns having values separated by '|'
list_1 = ['cast', 'director', 'keywords', 'genres', 'production_companies']
Some information about the data:
df.info()
df.shape
The dataset has 10866 rows and 20 columns
df.describe()
We can notice that there are some movies having the budget and revenue equals to 0, these can be considered as NaN values.
l = df['original_title'].loc[(df['budget']==0.000000e+00) & (df['revenue']==0.000000e+00)].values
#select movie names where both the budget and revenue are equals to 0
l
Types of features :
df.dtypes
We can notice that the release date is an object and not in a date_time type
Checking for duplicate rows:
df.duplicated().value_counts()
checking for the duplicated row:
df[df.duplicated(keep=False)]
Number of rows with missing data:
df.isnull().values.ravel().sum()
To check if we have missing values for each feature:
df.isnull().sum()
Checking for unique values :
df.nunique()
df['original_title'].unique() #names of unique movie titles
df['original_title'].nunique() #number of unique movie titles
I noticed that there are some columns we might not need in further analysis such as : imdb_id since we already have an id column, homepage, tagline, overview, and keywords since we're not doing a movie recommendation in this project.
df.columns
list_2=['imdb_id','homepage','tagline', 'keywords', 'overview' ]
df.drop(columns= list_2, inplace=True)
df.head(1)
Dropping duplicates:
df.drop_duplicates(inplace=True)
checking for duplicates again:
df.duplicated().any()
Dropping null values:
df.dropna(inplace=True)
checking for null values again:
df.isna().any()
I noticed in the first part of this project that the release date of each movie is not in date format but in object format. So, we should change it:
df['release_date'].dtype
release_date = pd.to_datetime(df['release_date'])
release_date
I encountered a problem where each release date written for example as '11/15/66' is going to be converted to '2066-11-15' instead of 1966 as noted in the release_year
#changing the release_date year part with the release_year values :
df['release_date'] = df.apply(lambda x: x.release_date[:-2] + str(x.release_year), axis=1)
df['release_date'] = pd.to_datetime(df['release_date'])
df['release_date']
as noted in the first part of the notebook, i noticed that some of the budget and revenue values are zeros:
df.loc[(df['budget']==0.000000e+00) & (df['revenue']==0.000000e+00)]
df.loc[df['revenue']==0.000000e+00]
df.loc[df['budget']==0.000000e+00]
We notice that some of the movies have a revenue but the budget is zero and vice versa! and that is really odd and not realistic. We should change the value of each budget to the mean to make it more realistic, and vice versa
Checking how many zero values are there in each feature :
df[['budget', 'revenue']].apply(lambda x: x == 0).sum()
We have to see first if dropping the zero values of budgets and revenues will affect our data, and the information related to each feature:
def entropy(Y):
"""
Also known as Shanon Entropy
"""
unique, count = np.unique(Y, return_counts=True, axis=0)
prob = count/len(Y)
en = np.sum((-1)*prob*np.log2(prob))
return en
entropy(df['budget'])
entropy(df['revenue'])
df_cleaned = df[(df['budget'] != 0) & (df['revenue'] != 0)]
entropy(df_cleaned['budget'])
entropy(df_cleaned['revenue'])
Increasing in shanon's entropy is an indication of information loss : We don't need to drop all the rows containing zeros in budget and revenue columns
we can also see this impact with boxplots:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10,8))
ax[0].boxplot(df['budget'])
ax[1].boxplot(df_cleaned['budget'])
plt.suptitle("Information loss after dropping all zero values in budget");
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10,8))
ax[0].boxplot(df['revenue'])
ax[1].boxplot(df_cleaned['revenue'])
plt.suptitle("Information loss after dropping all zero values in revenue");
Grouping the budget mean by year:
mean_budg = df.groupby('release_year')[['budget']].mean()
mean_budg.head(10)
Grouping mean revenue by year :
mean_rev = df.groupby('release_year')[['revenue']].mean()
mean_rev.head(10)
Replacing zero values of budget and revenue :
Replacing zero values in budget when revenue isn't null:
zero_budg = df[(df.budget == 0) & (df.revenue != 0)] #budget equal to zero when revenue isn't null
"""
Function to replace zero values in budget and revenue
"""
def replace_zeros(row : pd.DataFrame, columns : list, df: pd.DataFrame):
if (row[columns]==0).all():
row[columns] = df.loc[row.release_year, columns]
return row
zero_budg = zero_budg.apply(lambda x: replace_zeros(x, ['budget'], mean_budg),axis=1) #replacing budget with mean value
df[df.index.isin(zero_budg.index)] = zero_budg #replacing budget zero values in the dataset
check again if we filled the zero budget values when the revenue isn't null:
df[(df.budget == 0) & (df.revenue != 0)]
Do the same to revenue when budget isn't null but revenue is equal to zero:
zero_revenue = df[(df.revenue == 0) & (df.budget != 0)] #revenue zero values when the budget isn't null
zero_revenue = zero_revenue.apply(lambda x: replace_zeros(x, ['revenue'], mean_rev),axis=1) #replacing zero value revenues with the mean
df[df.index.isin(zero_revenue.index)] = zero_revenue #replacing budget zero values in the dataset
check again if we filled the zero revenue values when the budget isn't null:
df[(df.budget != 0) & (df.revenue == 0)]
I also noticed that the runtime also has some zero values which is unrealistic! to tackle this problem we should replace each zero runtime with the mean value from each year
zero_runtime = df[df['runtime'] == 0]
zero_runtime
Grouping the runtime by year:
mean_runtime = df.groupby('release_year')[['runtime']].mean()
mean_runtime.head(10)
zero_runtime = zero_runtime .apply(lambda x: replace_zeros(x, ['runtime'], mean_runtime),axis=1) #replacing runtime with mean value
df[df.index.isin(zero_runtime.index)] = zero_runtime #replacing runtime zero values in the dataset
checking again for zero runtime:
df[df['runtime'] == 0]
I noticed that there is a row where we have two directors at the same time, So I'm going to use the pandas explode function to make two rows for each director
df[df.index == 14938].director
def explode(df: pd.DataFrame, col: str):
df_new = df.copy()
df_new[col] = df_new[col].str.split("|")
df = df_new.explode(col)
return df
#function to split a row into two based on the name of the director split by "|"
explode(df, 'director')
checking for statistical info about the data:
df.describe()
First i want to see the distribution of each feature :
df.hist(figsize=(12,12));
we can see that there are some variables that are skewed either to the right such as : budget and budget_adj , or to the left such as : release_year and release_date. But some variables have a normal distribution such as vote_average
Distribution of popularity:
df.popularity.hist(figsize=(8, 6))
plt.title("Popularity Distribution", fontsize=18);
plt.xlabel("Popularity")
plt.ylabel("Number of Movies")
plt.show();
Most popular movie of all between the years 2015 and 1960 :
df.loc[df['popularity'].idxmax()] #idxmax : returns the index of the first occurrence of maximum over requested column
Most popular movie : Jurassic World(2015)
#most popular movies :
most_pop = df.sort_values('popularity', axis=0, ascending=False).head(10)
most_pop
most_pop_by_year = most_pop.groupby(['release_year'])['genres'].value_counts()
most_pop_by_year
we can see that most popular genres between 2015 and 1960 are : Action, Adventure, Science Fiction and Thriller
profit of movies:
df['profit'] = df.revenue - df.budget
df.profit.head()
df.profit.describe()
movies having the most profit (having profit > 75%):
successful_movies = df[df.profit >= 3.244257e+07]
successful_movies.sort_values('profit', axis=0, ascending=False).head(5)
successful_movies.groupby('release_year')['original_title'].value_counts()
df.revenue.describe()
movies having the highest revenues (having revenues > 75%):
high_movies = df[df.revenue >= 4.946531e+07]
high_movies.head(5)
high_movies.describe()
high_movies[['original_title', 'popularity', 'release_date', 'profit', 'budget', 'revenue', 'vote_average']]
m =df.loc[(df['budget'] > df['budget'].quantile(0.75)) & (df['vote_count'] < df['vote_count'].quantile(0.25))]
m[['original_title', 'release_date', 'budget', 'vote_count']]
m_2 =df.loc[(df['budget'] < df['budget'].quantile(0.25)) & (df['vote_count'] > df['vote_count'].quantile(0.75))]
m_2[['original_title', 'release_date', 'budget', 'vote_count']]
it seems to be that there are no low budget movies that have very high vote counts
#successful movies:
s= successful_movies.where(successful_movies['vote_count']> successful_movies['vote_count'].quantile(0.75))
s[['original_title', 'cast', 'director', 'release_date']].head(15)
#flopped movies:
flopped_movies = df[df.profit < 3.244257e+07]
flopped_movies[['original_title', 'cast', 'director', 'release_date']].head(15)
import plotly.express as px
figure = px.scatter(data_frame = df,
x="runtime",
y="popularity",
size="runtime",
trendline="ols",
title = "Relationship Between Runtime and Popularity")
figure.show()
Runtime and poularity aren't correlated
s.sort_values('popularity', axis=0, ascending=False).head(10).groupby(['runtime', 'release_year'])['original_title'].sum()
most_pop['profit'] = df.profit
most_successful = successful_movies.sort_values('profit', axis=0, ascending=False).head(10)
most_successful
most_pop[['original_title', 'popularity', 'release_date', 'profit']]
Pie Chart representing profit of successful movies VS profit of popular ones :
plt.pie([most_pop.profit.sum(), most_successful.profit.sum()],
labels=['Profit of Popular movies', 'Profit of successful movies'],
colors=['c', 'm'], autopct='%.1f%%',
textprops={'size': 'large'});
plt.title("profit of successful movies VS profit of popular ones")
plt.show();
percentage of popular movies that have highest profit :
perc = (most_pop.index.isin(most_successful.index).sum()) / most_pop.shape[0] * 100
#most_pop.shape[0] : number of columns of most_pop
perc
We can notice that having the most profit doesn't mean the movie is also popular, also we have only 20% of the popular movies having also the highest profit
s[['original_title', 'production_companies', 'release_date']].head(15)
s.groupby('release_year')['production_companies'].sum()
pd.plotting.scatter_matrix(df, figsize=(15,15));
plt.figure(figsize=(8, 6))
plt.style.use('fivethirtyeight')
plt.title("Relationship Between Budget & Budget after inflation")
sns.regplot(x="budget", y="budget_adj", data=df)
plt.show()
plt.figure(figsize=(8, 6))
plt.style.use('fivethirtyeight')
plt.title("Relationship Between Budget & Revenue")
sns.regplot(x="budget", y="revenue", data=df)
plt.show()
plt.figure(figsize=(8, 6))
plt.style.use('fivethirtyeight')
plt.title("Relationship Between Budget & Vote Count")
sns.regplot(x="budget", y="vote_count", data=df)
plt.show()
plt.figure(figsize=(8, 6))
plt.style.use('fivethirtyeight')
plt.title("Relationship Between Budget & Popularity")
sns.regplot(x="budget", y="popularity", data=df)
plt.show()
Budget is positively correlated with 'budget_adj', 'revenue', 'vote_count' and 'popularity'
#successful movie:
list_3= s.sort_values('popularity', axis=0, ascending=False).head(1).cast.values
list_3
name of the movie:
s.sort_values('popularity', axis=0, ascending=False).head(1).original_title
#flopped movie:
list_4= flopped_movies.sort_values('popularity', axis=0, ascending=True).head(1).cast.values
list_4
name of the movie:
flopped_movies.sort_values('popularity', axis=0, ascending=True).head(1).original_title
In this project, I tried to analyse a dataset taken from The movie database (TMDb) of 10,000 movies.
The cleaning part is what I found the most interesting to do, I noticed that there are some null values and duplicates in the dataset, as well as zero values on some data points such as budget, revenue and runtime.
There are some movies having the budget and revenue equals to 0, these can be considered as NaN values. Normally we can drop these zero values if they don't have an impact on the overall information about the dataset, but since I checked with the help of Shannon's Entropy, I found that We don't need to drop all the rows containing zeros in budget and revenue columns, since increasing in shanon's entropy is an indication of information loss. Also, some of the movies have a revenue but the budget is zero and vice versa! and that is really not realistic. So i changed the value of each budget to the mean overall budget to make it more realistic, and vice versa.
I also noticed that the release date is an object and not in a date_time type, so I tried to change it but I encountered a problem where each release date written for example as '11/15/66' is going to be converted to '2066-11-15' instead of 1966 as noted in the release_year, so I changed each year part in the release_date to the year noted in the release_year feature to solve the problem.
There were some columns I did not need in further analysis such as : imdb_id since we already have an id column, homepage, tagline, overview, and keywords since we're not doing a movie recommendation in this project, so I dropped them.
I also noticed that the runtime also has some zero values which is unrealistic! to tackle this problem I replaced each zero runtime with the mean overall runtime from each year.
Here are answers to some questions asked earlier:
the most popular genres between 2015 and 1960 are : Action, Adventure, Science Fiction and Thriller
the top 5 movies having the most profit were: Jurassic World(2015) , Mad Max:Fury Road(2015), Insurgent(2015), Star Wars:The Force Awakens(2015) and Furious 7(2015).
the top 5 movies having the highest budget but the lowest vote counts: The Swan Princess: A Royal Family Tale(2014), Red Sky(2014), Mao's Last Dancer(2009), Sinners and Saints(2010) and Double Wedding(2010).
the top 5 movies who flopped in terms of profit: Jupiter Ascending, Ex Machina, Tomorrowland, Mr. Holmes and Room.
We can assume that Runtime and poularity aren't correlated based on graphs, also we can notice that having the most profit doesn't mean the movie is also popular; only 20% of the popular movies have also the highest profit.
The dataset has alot of zero values, null values and wrong format, it makes it a bit hard for me to conduct a proper analysis, nevetherless, I found that an interesting problem to tackle.
Also, having the revenue columns filled with values when the budget columns has alot of zero values and vice versa was a non realistic thing and a problem that I found equally interesting.
Another limitaion for me is conducting exploratory analysis with visualizations, I found that I'm more familiar with wrangling the data.
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])